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In analogy to Neuberger's double-pass algorithm for the Conjugate Gradient inversion with multi- 
shifts we introduce a double-pass variant for BiCGstab(f). One possible application is the overlap 
operator of QCD at non-zero chemical potential, where the kernel of the sign function is non- 
Hermitian. The sign function can be replaced by a partial fraction expansion, requiring multi- 
shift inversions. We compare the performance of the new method with other available algorithms, 
namely partial fraction expansions with restarted FOM inversions and the Krylov-Ritz method 
using nested Krylov subspaces. 
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1. Introduction and Motivation 

In this contribution we present double-pass variants for the multi-shift inverter BiCGstab(f) 1 
which, in some cases, can perform better than the conventional single -pass. The method is an ana- 
logue to Neuberger's double -pass Conjugate Gradient (CG) method [1, 2]. The use of BiCGstab(f) 
instead of CG can be a speed advantage (for Hermitian matrices) or necessary (for non-Hermitian 
matrices). One possible application is the computation of quark propagators for a set of distinct 
masses. Here, however, we focus on computing the overlap operator of QCD. At non-zero quark 
chemical potential, /I ^ 0, it is defined as 

D ov (n) = l + 7 5 sign(7 5 D v „( i u)), (1.1) 

where D w (fi) is the (Wilson) Dirac operator with chemical potential. For fi / the matrix y$D w (ii) 
is non-Hermitian. One way to compute the sign function of such a matrix, acting on a given vector 
b, is via a partial fraction expansion (PFE), 

where we are especially interested in the case of A = (ysD w ) 2 with f(A) = l/VA, since signz = 
z/v 7 ?- The vectors (A + c s )~ l b for a set of shifts {a 4 } can be approximated by iterative inverters 
which find solutions in a Krylov subspace, defined as 

Jf k {A,b) = span(b,Ab, . . . ,A k - l b). (1.3) 

A crucial feature of Krylov subspaces is their shift invariance, + o s ,b) = <%k(A,b), which 

allows for so called multi-shift inversions, where one Krylov subspace suffices to compute (A + 
o s )~ Y b for a set {o s } with little overhead per additional shift. We will refer to methods employing 
Eq. (1.2) as PFE methods. 

2. Double-pass algorithm 

As a stalling point, we consider established algorithms to compute the sign function of a non- 
Hermitian matrix, (i) the Krylov-Ritz method with nested Krylov subspaces, introduced in [3], (ii) 
PFEs with FOM inversions, introduced in [4] and (iii) PFEs with BiCGstab(^) as inverter. The latter 
has so far not been considered in the context of the sign function. For details on the BiCGstab(f) 
method see [5], a version with shifts was introduced in [6]. Benchmark results are given in Fig. 1. 
The nested Krylov-Ritz algorithm outperforms both PFE methods, which is somewhat surprising 
since all rely on a similar Krylov subspace. 2 We can gain more insight by analysing the bad 
performance of BiCGstab(f) in this case: 

• Denote by N s the number of shifts and by M s the number of outer iterations of the BiCGstab(^) 
algorithm until the system with shift o s is converged. Then, the multi-shift version has 
E^Li M s l(0.5l + 4.5) "axpy" operations (y <s— ax +y, for scalar a and vectors x and y) more 
than the BiCGstab(£) algorithm without shifts. 

I is the degree of the Minimal Residual polynomial in the algorithm 

2 Note however that the employed single-pass (nested) Krylov-Ritz method requires a huge amount of memory. 
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Figure 1: Accuracy vs. computation time for the overlap operator on a 4 3 x 8 lattice, j3 = 5.32, fx = 0.05 
with the Neuberger PFE. The number of poles N s is chosen minimal for the desired accuracy as in [4]. In 
this plot it ranges from 10 (accuracy 0.01) to 58 (accuracy 10~ 10 ). The 4 eigenvalues smallest in magnitude 
are deflated in advance. 

• BiCGstab(f) requires 21 + 5 vectors and the multi-shift version has N s (l + 1) additional shift 
vectors, where typically N s = 0(10). These figures should be seen in relation to the typical 
cache size of current processors (iff (I MByte)) and the size of a vector, e.g., 50 kByte (local 
volume 4 4 ) or 800 kByte (local volume 8 4 ). In a typical case not all shift vectors fit into 
cache and the access to main memory can become the bottleneck of the algorithm. 

To tackle these performance restraints one can try a double-pass approach in analogy to Neu- 
berger's double-pass algorithm for a multi-shift CG inversion. Schematically the idea is as follows: 
The quantity computed in Eq. (1.2) and approximated in a Krylov subspace is 

N s N s N 

£ <o s (A + a s )- l b « £ <o s £ Wsi (2.1) 

s= 1 s=\ n=\ 

(n) 

where N is the number of iterations in the inverter and vv, is a vector for shift s in iteration n. To 
remove s dependent vectors one could try to swap the sums over s and n, however Wj is given by 
a recursion relation, 

W W = a^wt l) +/3i' l) vW = £ $v W , (2.2) 

(=i 

where is an unshifted iteration vector. In the last step the recursion of the vectors wi was 
resolved. By combining Eqs. (2.1) and (2.2) and summing over s (and n), all vectors depending on 

(n) 

s are removed from the algorithm. However, the coefficients /; = J^ s n / are not known until the 
end of the iteration. There are two options 

1. (double -pass): Follow Neuberger's approach by running the algorithm once to obtain In 
a second pass generate the vectors again and compute Ji v ■ 
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Figure 2: Computation time vs. N s for fixed Ml = 256 for a 4 3 x 8 lattice (Wilson Dirac operator) for all 
three BiCGstab(f) variants. Results are given for I = 1,2,4,8 with lines solid to dotted. 

2. (pseudo-double-pass): Compute the coefficients Ji as in double-pass, but store all vW during 
the first pass instead of recomputing them in a second pass. 

Both methods remove all ^-dependent vectors from the algorithm, hence reducing the number of 
operations and number of vectors to be held in cache. In our case, to obtain the coefficients cor- 
responding to the (schematic) coefficients yu the recursion has to be solved for the BiCGstab(f) 
algorithm. The result is given in Sec. 4. 

3. Cost analysis and benchmarks 

The number of operations (scalar ones are omitted) and vectors of the multi-shift BiCGstab(f) 
algorithms are given in the following table (M denotes the number of outer iterations of the algo- 
rithm and the dimension of the Krylov space is 2MI): 



Method 


#Mv 


#axpy 


#dot-products 


#vectors 


1-pass 


2MI 


M/(1.5/- 


1-5.5) 


4-1^^/(0.5/+ 4.5) 


M/(0.5/ + 3.5) 


2/ + 5+iV,(/ + l) 


2-pass 


AMI 


Ml{\.5l- 


h5.5) 


■f Ml(l.5l+4.5)+2Ml 


MZ(0.5/ + 3.5) 


2Z + 5 


pseudo-2-pass 


2MI 


Ml(1.5l- 


1-5.5) 


4-2MZ 


M/(0.5/ + 3.5) 


2/ + 5 + 2M/ 



The number of vectors alone is not always meaningful: In single-pass a considerable subset 3 of the 
N s (l +1) vectors is accessed in each iteration of the algorithm. In pseudo-double-pass each of the 
2MI vectors is written and read exactly once, in total. That is, the access pattern of pseudo-double- 
pass requires less memory access than single-pass, even though many more vectors are involved. 

As a naive test of the figures given in the table we consider the algorithm runtime for fixed 
Ml with a varying number of shifts, given in Fig. 2. The two-pass and pseudo-two-pass runtime 
is largely independent of N„. The pure operation count of two-pass would yield an almost doubled 
computation time compared to pseudo-two-pass. In practice, however, it is less since no (or less) 

3 depending on £, and on the removal of converged systems from the iteration 
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Figure 3: Accuracy vs. computation time for the overlap operator (4 3 x 8 lattice with j3 = 5.32, ji = 0.05). 
The timings are averaged over 200 independent gauge configurations. Note that an extreme case with little 
deflation (4 eigenvalues smallest in magnitude) was chosen for this plot where many poles are required (as 
before N s is scaled from 10 to 58), yielding a large speed advantage of the double-pass algorithms. When less 
poles are needed (e.g., for small jj, when the Zolotarev PFE can be used instead of the Neuberger expansion) 
the performance difference is often smaller. 



main memory access is required. An effect of the cache size can be seen from the single -pass I = 1 
curve. The slope changes in the vicinity of N s = 10, which is consistent with the cache size of 
4 MByte and the size of a vector of 100 kByte. As a further observation, the relative performance 
loss for large £ is much smaller in the double-pass methods compared to single-pass, which is not 
surprising since the number of required vectors increases with £. 4 

As a more realistic benchmark we compute the overlap operator for given configurations in 
Fig. 3. The double-pass and pseudo-double-pass BiCGstab(f) algorithms perform as well or even 
better than the nested double-pass and single-pass algorithms, respectively. Note that also the 
respective memory requirements are similar. In double-pass the performance does not degrade for 
/ > 1 as it does for single pass. To explore differences between the nested Krylov-Ritz method 
and the BiCGstab(^) methods a series of benchmarks was performed, where both the number of 
deflated eigenvectors and the chemical potential pL were varied. The tests indicate that BiCGstab(f) 
profits more from deflation than the Krylov-Ritz method does. On the other hand, for large /i, 
BiCGstab(£) tends to stagnate earlier than Krylov-Ritz. 

Finally we give results of a larger-scale simulation in Fig. 4. As before, pseudo-double-pass 
BiCGstab(£) is the algorithm performing best. The optimal £ depends on the number of cores iV c . 
Due to memory limitations for small N c and network limitations for large N c , there is an optimal 
N c minimizing the total CPU time. 

4. Algorithm details 

We follow the notation in Ref . [6] where also a listing of BiCGstab(f) is given. Upper indices 
4 Since we work at fixed Ml this plot does not tell which ( is optimal, since the convergence rate depends on (... 
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Figure 4: Total computation time vs. number of cores for the nested Krylov-Ritz method and pseudo- 
double-pass BiCGstab(f) for / = 1 and 8. The simulation uses a 12 3 x 24 lattice, j3 = 5.71, = 0.016667 
with 44 deflated eigenvalues. The accuracy is 10~ 10 , obtained with N s = 16 shifts. The dimension of the 
Krylov subspace is about 2000 for BiCGstab(^ = 1,8) as well as for nested Krylov-Ritz. The benchmarks 
where done on an Opteron 2354 Cluster (2.2 GHz, 2 quad-core processors per node, 16 GByte RAM per 
node, Infiniband network). Lines are drawn to guide the eye. 



mj denote iteration j of BiCG part and outer iteration m. Define the coefficients 



b: 



M-l [ i-1 

: E IWIIK-J 85 

n=m+\ 1^7=0 



Dt 



1 



i—i 



' n-\ ( I —y k \ l-l 

n (hA-m-^ iw* 

J=m+1 \j=0 \ V I J k'=0 



(4.1) 

(4.2) 



FL,= 



n (-B s ) mk 



7 * = ^L 

"mj (ysy 
1 



mj (a s ) m J(& s (p s ) m j , mj (a s ) mJ (& s <P£ ev ,) mJ ' 
where 7o = — 1, and a matrix 



\'=j+l ) k=j+\ 

(4.4) 



y-i 



(M m )jk = - e a ' mi n (-F np ) 

q=k \p=k+\ , 



j,k / 1. 



Then the contribution to £ $ co s x s from the BiCG part is given by 

A mp ((/¥ m )v+E (e°>xa„ ) {{M m r%k 



M-l 1-1 ( I-l p 

XfiiCG = H H\H11 

«i=0 7=0 [p=jk=j 



=0 \ s 



k-1 



(r 7 -r-E« m? n (-r ,y ) (u ;+ ir 



E atfAjK) (*j) mJ + E ^KtE S mjG s mj {(rjr-a m J(u J+1 ) 



(4.5) 



(4.6) 
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All operations involving the vectors u- can be removed from the original algorithm. The final 
result is given by adding xbjcg to the contributions of the seed system and the MR-part of the algo- 
rithm, Y, s Q^MR' which is computed trivially. For reference an implementation of the algorithms is 
provided online at http://sourceforge.net/projects/bicgstabell2p/. 

5. Conclusions 

We have presented an extension of the double-pass trick from Conjugate Gradient to the 
more general BiCGstab(f). While initially PFE methods looked inferior to the nested Krylov-Ritz 
method in the non-Hermitian case, our new (pseudo-)double-pass BiCGstab(f) is a method with 
similar performance. Our benchmarks concentrated on the overlap operator where pseudo-double- 
pass performs as well or even better than the nested Krylov-Ritz method on the tested architectures. 
Current supercomputers might have enough main memory such that pseudo-double-pass is feasi- 
ble, but this will depend on details of the simulation. Large values of £ yield less overhead in the 
double-pass methods compared to single-pass. This could boost the application of the algorithm 
in problems where / > 1 is crucial for convergence. We plan to investigate the efficiency of the 
double-pass BiCGstab(^) algorithms for other functions aside from the inverse square root. 

As a closing remark let us mention that a pseudo-double-pass method can also be used instead 
of the usual (double -pass) Conjugate Gradient method. This extension seems trivial, though we are 
not aware of any mention in the literature. 
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